For the following exercises, we will also use data from Gapminder; this time on life expectancy.
As per usual, we first need to read in the data. You can just copy, paste and run the following code in(to) your script.
library(readr)
gap_life <- read_csv("../data/gapminder/life_expectancy_years.csv")
Again, the data are currently in wide format.
starts_with(). We also want to keep the country column.
library(dplyr)
gap_life %>%
select(country, starts_with("19"))
## # A tibble: 187 x 101
## country `1900` `1901` `1902` `1903` `1904` `1905` `1906` `1907` `1908`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan~ 29.2 29.3 29.3 29.4 29.4 29.5 29.6 29.6 29.7
## 2 Albania 35.5 35.5 35.5 35.5 35.5 35.5 35.5 35.5 35.5
## 3 Algeria 30.1 30.2 30.3 31.3 25.3 28 29.5 29.4 29.3
## 4 Andorra NA NA NA NA NA NA NA NA NA
## 5 Angola 29.5 29.6 29.7 29.8 29.9 30 30.1 30.1 30.2
## 6 Antigu~ 33.7 33.7 33.7 33.7 33.7 33.7 33.7 33.7 33.7
## 7 Argent~ 36.6 37.2 37.8 38.3 38.9 39.5 40.2 41 41.7
## 8 Armenia 35.2 35.4 35.6 35.8 36.1 36.3 36.5 36.7 36.9
## 9 Austra~ 50 50.5 51.1 51.6 52.1 52.7 53.2 53.7 54.3
## 10 Austria 41.5 42 41 40.1 40.7 41.3 42 42.6 43.2
## # ... with 177 more rows, and 91 more variables: `1909` <dbl>,
## # `1910` <dbl>, `1911` <dbl>, `1912` <dbl>, `1913` <dbl>, `1914` <dbl>,
## # `1915` <dbl>, `1916` <dbl>, `1917` <dbl>, `1918` <dbl>, `1919` <dbl>,
## # `1920` <dbl>, `1921` <dbl>, `1922` <dbl>, `1923` <dbl>, `1924` <dbl>,
## # `1925` <dbl>, `1926` <dbl>, `1927` <dbl>, `1928` <dbl>, `1929` <dbl>,
## # `1930` <dbl>, `1931` <dbl>, `1932` <dbl>, `1933` <dbl>, `1934` <dbl>,
## # `1935` <dbl>, `1936` <dbl>, `1937` <dbl>, `1938` <dbl>, `1939` <dbl>,
## # `1940` <dbl>, `1941` <dbl>, `1942` <dbl>, `1943` <dbl>, `1944` <dbl>,
## # `1945` <dbl>, `1946` <dbl>, `1947` <dbl>, `1948` <dbl>, `1949` <dbl>,
## # `1950` <dbl>, `1951` <dbl>, `1952` <dbl>, `1953` <dbl>, `1954` <dbl>,
## # `1955` <dbl>, `1956` <dbl>, `1957` <dbl>, `1958` <dbl>, `1959` <dbl>,
## # `1960` <dbl>, `1961` <dbl>, `1962` <dbl>, `1963` <dbl>, `1964` <dbl>,
## # `1965` <dbl>, `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>,
## # `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>,
## # `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>,
## # `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
## # `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>,
## # `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>,
## # `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>
As you may have already noticed, the dataset comprises some missing data points. Before we start analyzing the data, we might want to know for how many countries we have complete data.
drop_na() function from tidyr.
library(tidyr)
gap_life %>%
drop_na() %>%
nrow()
## [1] 184
As in the previous set of data wrangling exercises, we now want to transform the data into the long format.
integer.
mutate().
gap_life <- gap_life %>%
gather(-country, key = "year", value = "lifeExp") %>%
mutate(year = as.integer(year))
Now let’s apply some of the advanced filtering options we discussed in the Data Wrangling - Part 2 session.
Create two new dataframes that include different subets of the gap_life data:
Data for all countries for 1990s (name this one gap_life_1990s),
Data for all years but only for Germany (name this one gap_life_ger).
dplyr to create the first new data frame and a specific matching operator to create the second one.
gap_life_1990s <- gap_life %>%
filter(between(year, 1990, 1999))
gap_life_1990s
## # A tibble: 1,870 x 3
## country year lifeExp
## <chr> <int> <dbl>
## 1 Afghanistan 1990 51.6
## 2 Albania 1990 73.9
## 3 Algeria 1990 71
## 4 Andorra 1990 79.5
## 5 Angola 1990 50.2
## 6 Antigua and Barbuda 1990 73.8
## 7 Argentina 1990 72.1
## 8 Armenia 1990 70.2
## 9 Australia 1990 77
## 10 Austria 1990 75.5
## # ... with 1,860 more rows
gap_life_ger <- gap_life %>%
filter(country %in%
c("Germany", "West Germany", "East Germany"))
gap_life_ger
## # A tibble: 219 x 3
## country year lifeExp
## <chr> <int> <dbl>
## 1 Germany 1800 38.4
## 2 Germany 1801 38.4
## 3 Germany 1802 38.4
## 4 Germany 1803 38.4
## 5 Germany 1804 38.4
## 6 Germany 1805 38.4
## 7 Germany 1806 38.4
## 8 Germany 1807 38.4
## 9 Germany 1808 38.4
## 10 Germany 1809 38.4
## # ... with 209 more rows
For some comparisons (especially via plots), it might help to know which continent the country is located on. For this purpose, we will create a new continent variable. As it would be quite tedious to create this variable manually for all of the countries in the dataset, we will do this only for a subset in this exercise. Just run the following code in your local script to create this subset.
library(dplyr)
gap_life_subset <- gap_life %>%
filter(country %in%
c("Netherlands", "Brazil", "China", "Algeria", "New Zealand"))
case_when() to create this new variable.
gap_life_subset %>%
mutate(continent = factor(case_when(
country == "Algeria" ~ "Africa",
country == "Brazil" ~ "Americas",
country == "China" ~ "Asia",
country == "Netherlands" ~ "Europe",
country == "New Zealand" ~ "Oceania")
))
## # A tibble: 1,095 x 4
## country year lifeExp continent
## <chr> <int> <dbl> <fct>
## 1 Algeria 1800 28.8 Africa
## 2 Brazil 1800 32 Americas
## 3 China 1800 32 Asia
## 4 Netherlands 1800 39.9 Europe
## 5 New Zealand 1800 34 Oceania
## 6 Algeria 1801 28.8 Africa
## 7 Brazil 1801 32 Americas
## 8 China 1801 32 Asia
## 9 Netherlands 1801 39.9 Europe
## 10 New Zealand 1801 34 Oceania
## # ... with 1,085 more rows